In this example, we are going to learn about an alternative method to encode text data known as word embeddings. This is an incomplete tutorial on word embeddings but will at least give you the basic understanding on when and why we use them.
Learning objectives:
# Initialize package
library(keras)
library(fs)
library(tidyverse)
library(glue)
library(progress)
# helper functions we'll use to explore word embeddings
source("helper_functions.R")
So far, we’ve been using the built-in IMBD dataset. Here, we are going to use the original data files which can be found at http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz. You can find the download instructions here. For those in the workshop we have already downloaded this data for you.
imdb_dir <- here::here("materials", "data", "imdb")
fs::dir_tree(imdb_dir, type = "directory")
[01;34m/Users/b294776/Desktop/Workspace/Training/rstudio-conf-2020/dl-keras-tf/materials/data/imdb[0m
├── [01;34mtest[0m
│ ├── [01;34mneg[0m
│ └── [01;34mpos[0m
└── [01;34mtrain[0m
├── [01;34mneg[0m
└── [01;34mpos[0m
You can see the data have already been separated into test vs training sets and positive vs negative sets. The actual reviews are contained in individual .txt files. We can use this structure to our advantage - the below iterates over each review and
training_files <- file.path(imdb_dir, "train") %>%
dir_ls() %>%
map(dir_ls) %>%
set_names(basename) %>%
plyr::ldply(data_frame) %>%
set_names(c("label", "path"))
training_files
We can see our response observations are balanced:
count(training_files, label)
We can now iterate over each row and
obs <- nrow(training_files)
labels <- vector(mode = "integer", length = obs)
texts <- vector(mode = "character", length = obs)
# this just allows us to track progress of our loop
pb <- progress_bar$new(total = obs, width = 60)
for (file in seq_len(obs)) {
pb$tick()
label <- training_files[[file, "label"]]
path <- training_files[[file, "path"]]
labels[file] <- ifelse(label == "neg", 0, 1)
texts[file] <- readChar(path, nchars = file.size(path))
}
We now have two vectors, one consisting of the labels and…
table(labels)
labels
0 1
12500 12500
the other holding each review.
texts[1]
[1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
A little exploratory analysis will show us the total number of unique words across our corpus and the average length of each review.
text_df <- texts %>%
tibble(.name_repair = ~ "text") %>%
mutate(text_length = str_count(text, "\\w+"))
unique_words <- text_df %>%
tidytext::unnest_tokens(word, text) %>%
pull(word) %>%
n_distinct()
avg_review_length <- median(text_df$text_length, na.rm = TRUE)
ggplot(text_df, aes(text_length)) +
geom_histogram(bins = 100, fill = "grey70", color = "grey40") +
geom_vline(xintercept = avg_review_length, color = "red", lty = "dashed") +
scale_x_log10("# words") +
ggtitle(glue("Median review length is {avg_review_length} words"),
subtitle = glue("Total number of unique words is {unique_words}"))
Word embeddings are designed to encode general semantic relationships which can serve two principle purposes. The first is for language modeling which aims to encode words for the purpose of predicting synonyms, sentence completion, and word relationships.
See slides for more discussion of this type of modeling. We are not focusing on word embeddings for this purpose; however, I have written a couple helper functions to train word embeddings for this purpose. See the code behind these helper functions here.
# clean up text and compute word embeddings
clean_text <- tolower(texts) %>%
str_replace_all(pattern = "[[:punct:] ]+", replacement = " ") %>%
str_trim()
word_embeddings <- get_embeddings(clean_text)
Creating vocabulary...
Creating term-co-occurence matrix...
Computing embeddings based on GloVe algorithm...
INFO [2019-12-09 09:14:04] 2019-12-09 09:14:04 - epoch 1, expected cost 0.0821
INFO [2019-12-09 09:14:05] 2019-12-09 09:14:05 - epoch 2, expected cost 0.0555
INFO [2019-12-09 09:14:06] 2019-12-09 09:14:06 - epoch 3, expected cost 0.0485
INFO [2019-12-09 09:14:07] 2019-12-09 09:14:07 - epoch 4, expected cost 0.0443
INFO [2019-12-09 09:14:08] 2019-12-09 09:14:08 - epoch 5, expected cost 0.0415
INFO [2019-12-09 09:14:09] 2019-12-09 09:14:09 - epoch 6, expected cost 0.0395
INFO [2019-12-09 09:14:11] 2019-12-09 09:14:11 - epoch 7, expected cost 0.0379
INFO [2019-12-09 09:14:12] 2019-12-09 09:14:12 - epoch 8, expected cost 0.0367
INFO [2019-12-09 09:14:13] 2019-12-09 09:14:13 - epoch 9, expected cost 0.0357
INFO [2019-12-09 09:14:14] 2019-12-09 09:14:14 - epoch 10, expected cost 0.0348
INFO [2019-12-09 09:14:15] 2019-12-09 09:14:15 - epoch 11, expected cost 0.0341
INFO [2019-12-09 09:14:16] 2019-12-09 09:14:16 - epoch 12, expected cost 0.0335
INFO [2019-12-09 09:14:17] 2019-12-09 09:14:17 - epoch 13, expected cost 0.0330
INFO [2019-12-09 09:14:18] 2019-12-09 09:14:18 - epoch 14, expected cost 0.0326
INFO [2019-12-09 09:14:19] 2019-12-09 09:14:19 - epoch 15, expected cost 0.0322
INFO [2019-12-09 09:14:20] 2019-12-09 09:14:20 - epoch 16, expected cost 0.0318
INFO [2019-12-09 09:14:22] 2019-12-09 09:14:22 - epoch 17, expected cost 0.0315
INFO [2019-12-09 09:14:23] 2019-12-09 09:14:23 - epoch 18, expected cost 0.0312
INFO [2019-12-09 09:14:23] Success: early stopping. Improvement at iterartion 18 is less then convergence_tol
Explore your own words!
# find words with similar embeddings
get_similar_words("horrible", word_embeddings)
horrible terrible awful bad acting
1.0000000 0.9132471 0.8663343 0.8041792 0.7790165
The other principle purpose for word embeddings is to encode text for classification reasons. In this case, we train the word embeddings to take on weights that optimize the classification loss function.
See slides for more discussion of this type of modeling.
To prepare our data we need to convert or labels vector to a tensor:
labels <- as.array(labels)
But more importantly, we need to preprocess our text features. To do so we:
text_tokenizer object which defines how we want to preprocess the text (i.e. convert to lowercase, remove punctuation, token splitting characters). For the most part, the defaults are sufficient.fit_text_tokenizer. This results in an object with many details of our corpus (i.e. word counts, word index).top_n_words <- 10000
tokenizer <- text_tokenizer(num_words = top_n_words) %>%
fit_text_tokenizer(texts)
names(tokenizer)
[1] "char_level" "document_count"
[3] "filters" "fit_on_sequences"
[5] "fit_on_texts" "get_config"
[7] "index_docs" "index_word"
[9] "lower" "num_words"
[11] "oov_token" "sequences_to_matrix"
[13] "sequences_to_texts" "sequences_to_texts_generator"
[15] "split" "texts_to_matrix"
[17] "texts_to_sequences" "texts_to_sequences_generator"
[19] "to_json" "word_counts"
[21] "word_docs" "word_index"
total_word_index <- tokenizer$word_index
num_words_used <- tokenizer$num_words
glue("We have now tokenized our reviews. ", "We are considering {num_words_used} ",
"of {length(total_word_index)} total unique words. The most common words ",
"include:")
We have now tokenized our reviews. We are considering 10000 of 88582 total unique words. The most common words include:
head(total_word_index)
$the
[1] 1
$and
[1] 2
$a
[1] 3
$of
[1] 4
$to
[1] 5
$is
[1] 6
Next, we extract our vectorized review data as a list. This looks familiar from the earlier modules.
sequences <- texts_to_sequences(tokenizer, texts)
# The vectorized first instance:
sequences[[1]]
[1] 62 4 3 129 34 44 7576 1414 15 3 4252 514 43 16 3 633 133
[18] 12 6 3 1301 459 4 1751 209 3 7693 308 6 676 80 32 2137 1110
[35] 3008 31 1 929 4 42 5120 469 9 2665 1751 1 223 55 16 54 828
[52] 1318 847 228 9 40 96 122 1484 57 145 36 1 996 141 27 676 122
[69] 1 411 59 94 2278 303 772 5 3 837 20 3 1755 646 42 125 71
[86] 22 235 101 16 46 49 624 31 702 84 702 378 3493 2 8422 67 27
[103] 107 3348
We can see how our tokenizer converted our original text to a cleaned up version:
cat("Original text:\n")
Original text:
texts[[1]]
[1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
cat("\nRevised text:\n")
Revised text:
paste(unlist(tokenizer$index_word)[sequences[[1]]] , collapse = " ")
[1] "story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the dialogue would make shakespeare seem easy to a third on a technical level it's better than you might think with some good cinematography by future great future stars sally and forrest can be seen briefly"
Next, since each review is a different length, we need to limit ourselves to a certain number of words so that all our features (reviews) are the same length.
Note (?pad_sequences): * Any reviews that are shorter than this length will be padded. * Any reviews that are longer than this length will be truncated.
max_len <- 150
features <- pad_sequences(sequences, maxlen = max_len)
features[1,]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[18] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[35] 0 0 0 0 0 0 0 0 0 0 0 0 62 4 3 129 34
[52] 44 7576 1414 15 3 4252 514 43 16 3 633 133 12 6 3 1301 459
[69] 4 1751 209 3 7693 308 6 676 80 32 2137 1110 3008 31 1 929 4
[86] 42 5120 469 9 2665 1751 1 223 55 16 54 828 1318 847 228 9 40
[103] 96 122 1484 57 145 36 1 996 141 27 676 122 1 411 59 94 2278
[120] 303 772 5 3 837 20 3 1755 646 42 125 71 22 235 101 16 46
[137] 49 624 31 702 84 702 378 3493 2 8422 67 27 107 3348
paste(unlist(tokenizer$index_word)[features[1,]], collapse = " ")
[1] "story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the dialogue would make shakespeare seem easy to a third on a technical level it's better than you might think with some good cinematography by future great future stars sally and forrest can be seen briefly"
Check out different reviews and see how we have transformed the data. Remove eval=FALSE to run.
# use review number (i.e. 2, 10, 150)
which_review <- ____
cat(crayon::blue("Original text:\n"))
texts[[which_review ]]
cat(crayon::blue("\nRevised text:\n"))
paste(unlist(tokenizer$index_word)[features[which_review ,]] , collapse = " ")
cat(crayon::blue("\nEncoded text:\n"))
features[which_review ,]
Our data is now preprocessed! We have 25000 observations and 150 features. Our features data is a matrix where each row is a single observation and each column represents the words in the review in the order that they appear.
dim(features)
[1] 25000 150
dim(labels)
[1] 25000
To train our model we will use the validation_split procedure within fit. Remember, this takes the last XX% of our data to be used as our validation set. But if you recall, our data was organized in neg and pos folders so we should randomize our data to make sure our validation set doesn’t end up being all positive or negative reviews!
set.seed(123)
index <- sample(1:nrow(features))
x_train <- features[index, ]
y_train <- labels[index]
To create our network architecture that includes word embeddings, need to include two things:
layer_embedding layer that creates the embeddings,layer_flatten to flatten our embeddings to a 2D tensor for our densely connected portion of our modelmodel <- keras_model_sequential() %>%
layer_embedding(
input_dim = top_n_words, # number of words we are considering
input_length = max_len, # length that we have set each review to
output_dim = 32 # length of our word embeddings
) %>%
layer_flatten() %>%
layer_dense(units = 1, activation = "sigmoid")
summary(model)
Model: "sequential_3"
_____________________________________________________________________________________________
Layer (type) Output Shape Param #
=============================================================================================
embedding_1 (Embedding) (None, 150, 32) 320000
_____________________________________________________________________________________________
flatten_1 (Flatten) (None, 4800) 0
_____________________________________________________________________________________________
dense_1 (Dense) (None, 1) 4801
=============================================================================================
Total params: 324,801
Trainable params: 324,801
Non-trainable params: 0
_____________________________________________________________________________________________
The rest of our modeling procedures follows the same protocols that you’ve seen in the other modules.
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = "accuracy"
)
history <- model %>% fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)
glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
Our optimal loss is 0.3 with an accuracy of 0.874
plot(history)
Spend a few minutes adjusting this model and see how it impact performance. You may want to test:
output_dim) impacts performance?yourturn_model <- keras_model_sequential() %>%
layer_embedding(
input_dim = _____,
input_length = _____,
output_dim = _____
) %>%
layer_flatten() %>%
layer_dense(units = ____, activation = ____) %>%
layer_dense(units = 1, activation = "sigmoid")
yourturn_model %>% compile(
optimizer = _____,
loss = "binary_crossentropy",
metrics = "accuracy"
)
yourturn_results <- yourturn_model %>% fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
Recall that the word embeddings we found for natural language modeling created results like:
# natural language modeling embeddings
get_similar_words("horrible", word_embeddings)
horrible terrible awful bad acting
1.0000000 0.9132471 0.8663343 0.8041792 0.7790165
However, embeddings we find for classification tasks are not always so clean and intuitive. We can get the word embeddings from our classification model with:
wts <- get_weights(model)
embedding_wts <- wts[[1]]
The following just does some bookkeeping to extract the applicable words and assign them as row names to the embedding matrix.
words <- tokenizer$word_index %>%
as_tibble() %>%
pivot_longer(everything(), names_to = "word", values_to = "id") %>%
filter(id <= tokenizer$num_words) %>%
arrange(id)
row.names(embedding_wts) <- words$word
The following is one of the custom functions you imported from the helper_functions.R file. You can see the word embeddings that most closely align to a given word are not as intuitive as those produced from the natural language model. However, these are the embeddings that optimized for the classification procedure at hand.
similar_classification_words("horrible", embedding_wts)
horrible foul source rivals homicide fits
1.0000000 0.7439281 0.7299364 0.7215308 0.7141167 0.7128572
Here’s a handy sequence of code that uses the t-SNE methodology to visualize nearest neighbor word embeddings.
# plotting too many words makes the output hard to read
n_words_to_plot <- 1000
tsne <- Rtsne::Rtsne(
X = embedding_wts[1:n_words_to_plot,],
perplexity = 100,
pca = FALSE
)
p <- tsne$Y %>%
as.data.frame() %>%
mutate(word = row.names(embedding_wts)[1:n_words_to_plot]) %>%
ggplot(aes(x = V1, y = V2, label = word)) +
geom_text(size = 3)
plotly::ggplotly(p)